Derrick Murphy

CSC 4322

Parallel Computing

1. **D**

My\_first= my\_rank\*(n/p)

My\_last=my\_rank\*n/p+(n/p)-1

* 1. Int receives;

…….

}else{

Send my\_x to the master;

receives++

* 1. Int receives;

…….

}else{

Send my\_x to the master;

receives++

* 1. Receive, due to a constant checking of data to receive.

**Write Back**

Pros:

* Low latency and high throughput for write-intensive applications.
* Best performer for mixed workloads as both read and write I/O have similar response time levels.

Cons:

* Data availability exposure risk because the only copy of the written data is in cache.

**Write Through**

Pros:

* Data updates are safely stored, for example a shared array.
* Good for applications that write and then re-read data frequently.

Cons:

* I/O experiences latency based on writing to that storage

**Shared Memory**

Pros:

* Good speed once access is granted.
* Easier to program

Cons:

* Limited scalability and applicability to purpose
* Differences in memory latencies are exposed to the programmer.

**Distributed Memory**

Pros:

* Cost-effective way to scale memory bandwidth
* Reduces latency of local memory accesses

Cons:

* Cost-effective way to scale memory bandwidth
* Must change software to take advantage of increased memory bandwidth

**IBM Power8**

Up to 230 GB/s sustained bandwidth

Cores:

12 cores (SMT8)

64k data cache

32k instruction cache

Clock Speed: 4 GHz

Cache Hierarchy:

512 KB SRAM L2 / core

96MB eDRAM shared L3

Up to 128 MB eDRAM L4 (off-chip)

Coherent Attached Processor Interface port allows for accelerators, GPUs, flash memory, networking, FPGAs to connect directly to process sharing same address space to increase performance, decrease latency

**IBM BlueGene/Q**

209 TF/rack

Cores:

A2 core, 16 core/64 thread SOC(4-way hardware threaded)

Quad floating points on each core(204.8 GF peak node)

Clock Speed:

Frequency target of 1.6 GHz

42.6 GB/s DDR3 bandwidth (1.333GHz DDR3)

One I/O link at 2.0 GB/s

10 intra-rack interprocessor links each at 2.0 GB/s

Cache Hierarchy:

32MB shared L2 cache

5D torus in compute nodes

Bisection bandwidth of 65TB/s (26PF/s) / 49TB/s (20 PF/s) BGL at LLNL is 0.7 TB/s

NewTech: Wakeup Unit